Workflow
I’m describing here the most important steps in the analysis.
Moreover, it is possible to find all parameters and figures at the .rmd
file.
Inspect read
quality profiles
Quality profiles of the forward reads:
In gray-scale is a heat map of the frequency of each quality score at
each base position. The median quality score at each position is shown
by the green line, and the quartiles of the quality score distribution
by the orange lines. The red line shows the scaled proportion of reads
that extend to at least that position
The forward reads are good quality I truncated the forward reads at
position 240.
Quality profile of the reverse reads:
The reverse reads are of significantly worse quality, especially at
the end, which is common in Illumina sequencing. This isn’t too
worrisome, as DADA2 incorporates quality information into its error
model which makes the algorithm robust to lower quality sequence, but
trimming as the average qualities crash will improve the algorithm’s
sensitivity to rare sequence variants. Based on these profiles, I
truncated the reverse reads at position 200 where the quality
distribution crashes.
Identify
primers
The universal primer set 5.8S and ITS1F (Fierer et al., 2005; Yarwood
et al., 2010) primers were used to amplify this dataset. We record the
DNA sequences, including ambiguous nucleotides, for those primers.
FWD <- “CTTGGTCATTTAGAGGAAGTAA”
REV <- “GCTGCGTTCTTCATCGATGC”
Sanity check of primers and adapters
## Forward Complement Reverse RevComp
## FWD.ForwardReads 0 0 0 0
## FWD.ReverseReads 0 0 0 0
## REV.ForwardReads 0 0 0 0
## REV.ReverseReads 0 0 0 0
Success! Primers are no longer detected in the cutadapted
reads.
The primer-free sequence files are now ready to be analyzed through
the DADA2 pipeline.
Learn the Error
Rates
The DADA2 method relies on a parameterized model of substitution
errors to distinguish sequencing errors from real biological variation.
Because error rates can (and often do) vary substantially between
sequencing runs and PCR protocols, the model parameters can be
discovered from the data itself using a form of unsupervised learning in
which sample inference is alternated with parameter estimation until
both are jointly consistent.
In order to verify that the error rates have been reasonably
well-estimated, we inspect the fit between the observed error rates
(black points) and the fitted error rates (black lines) in the next
Figure:
Everything looks reasonable and we proceed with confidence.
After filtering, we moved to the next step:
Sequence inference: The DADA2 sequence inference step removed -
nearly - all substitution and indel errors from the data. We now merge
together the inferred forward and reverse sequences, removing paired
sequences that do not perfectly overlap as a final control against
residual errors.
Construct sequence
table and remove chimeras
The DADA2 method produces a sequence table valued by the number of
times each sequence was observed in each sample.
Notably, chimeras have not yet been removed. The error model in the
sequence inference algorithm does not include a chimera component, and
therefore we expect this sequence table to include many chimeric
sequences. We now remove chimeric sequences by comparing each inferred
sequence to the others in the table, and removing those that can be
reproduced by stitching together two more abundant sequences.
The frequency of chimeric sequences varies substantially from dataset
to dataset, and depends on on factors including experimental procedures
and sample complexity.
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.950831
Inspect
distribution of sequence lengths:
As expected, quite a bit of length variability in the the amplified
ITS region.
Track reads through
the pipeline
We now inspect the the number of reads that made it through each step
in the pipeline to verify everything worked as expected.
Looks good! We kept the majority of our raw reads. The most impacting
was the filter step.
Taxa prevalence
versus total counts.
Each point is a different taxa. Exploration of the data in this way
is often useful for selecting filtering parameters, like the minimum
prevalence criteria we will used to filter the data above.
In this study, I removed taxa not seen more than 2 times in at least
1% of the samples. This protects against an ASV with small mean &
trivially large C.V.
Count prevalence
- Kingdom = 1
– Phylum = 15
— Class = 53
— Order = 128
—- Family = 232
—– Genus = 458
—— ASVs = 551
Replicates
Evaluation
In this step an evaluation of the replicates has been performed. The
main goal was to understand if they are any significant difference
between the replicates A and B.
Alpha Diversity Plot comparing the Replicates A and
B.
Quality control analysis using matched samples from 3 different
Transects: A, B and C of the experiment and replicates samples on each
Ecotype. Comparison of alpha diversity in technical replicates samples
on all Ecotypes from each Transect. ASV richness and ASV Shannon
diversity.
Evaluation of alpha diversity: looking for difference at the count
read distribution
Beta Diversity Plot comparing the Replicates A and
B.
Quality control analysis using matched samples from replicates A and
B. Beta diversity using Jensen-Shannon distance
Beta Diversity Plot comparing the Replicates A and
B.
In summary: there is no difference between A and B samples. In this
way we choose working the replicate A.
Rarefaction
Main Analysis
- Alpha Diversity Plot - A.
Raw alpha-diversity, i.e. without rarefying
comparing alpha diversity based on raw data has the huge problem
of ignoring sample size differences. Usually there is a trend that more
reads correlate with higher diversity.
Here, therefore I checked, whether there are significant sample
size differences between the groups.
We further look at the residuals of a linear fit alpha diversity
vs sample size to correct for this confounder.
Check whether sample_sums/sample sizes/library sizes differ
between groups
sample size adjustment has no influence on richness, so
sample_sizes are compared for raw counts
Alpha Diversity = BoxPlot
summary(aov(df2$Observed ~ Transect * Ecotype, data = df2))
## Df Sum Sq Mean Sq F value Pr(>F)
## Transect 2 1715 858 0.733 0.488718
## Ecotype 4 151580 37895 32.400 0.00000000017 ***
## Transect:Ecotype 8 51828 6478 5.539 0.000243 ***
## Residuals 30 35087 1170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Plot by mean
Ecotype
- Permutational analysis of variance
HeatMap
- Relative abundance of the top 20 ASVs
Summary
Help is needed with running the “nfcore/ampliseq” pipeline developed
at NGI for the analyses of fungal ITS1 amplicons, Illumina Miseq
analysis NGI project ID P5953 (M.Hogberg_17_01_project summary from
2018-01-19 by Chuan Wang refers to P9723, >=Q30 (mean(SD), 70(2) (%),
Sum Reads=15 650 000, Mean reads per sample 171 711, 1 pool of
amplicons, 1 flowcell v3, PE 2x301bp (validated method), demultiplexing,
quality control and raw data delivery on Uppmax (validated method).
Agreement number M.Hogberg_16_01-20160826. Grus delivery project
delivery 00654. Because my support application 2019-01-17 was rejected,
I have collaborated with partners in US on this matter but all is
extremely delayed for known reasons. I got some hope today when I read
about the recently developed pipeline for fungal analyses!
Unfortunately, I have no programming skills but have a BSc in Molecular
Biology.
Short summary of the work.
Support
Completion
You should soon be contacted by one of our managers with a request to
close down the project in our internal system and for invoicing matters.
If we do not hear from you within 30 days the project will be
automatically closed and invoice sent. Again, we would like to remind
you about data responsibility and acknowledgements, see sections:
Data Responsibility and
Acknowledgments.
You are welcome to come back to us with further data analysis request
at any time via http://nbis.se/support/support.html.
Thank you for using NBIS.